OPEN 3 ACCESS Freely available online 

The "Grep" Command But Not FusionMap, FusionFinder 
or ChimeraScan Captures the CIC-DUX4 Fusion Gene cros^rk 
from Whole Transcriptome Sequencing Data on a Small 
Round Cell Tumor with t(4;19)(q35;q13) 

loannis Panagopoulos^'^% Ludmila Gorunova^'^, Bodil Bjerkehagen^, Sverre Heim^'^''* 

1 Section for Cancer Cytogenetics, Institute for Cancer Genetics and Informatics, The Norwegian Radium Hospital, Oslo University Hospital, Oslo, Norway, 2 Centre for 
Cancer Biomedicine, Faculty of Medicine, University of Oslo, Oslo, Norway, 3 Department of Pathology, The Norwegian Radium Hospital, Oslo University Hospital, Oslo, 
Norway, 4 Faculty of Medicine, University of Oslo, Oslo, Norway 



Abstract 

Whole transcriptome sequencing was used to study a small round cell tumor in which a t(4;19)(q35;q13) was part of the 
complex karyotype but where the initial reverse transcriptase PCR (RT-PCR) examination did not detect a CIC-DUX4 fusion 
transcript previously described as the crucial gene-level outcome of this specific translocation. The RNA sequencing data 
were analysed using the FusionlVlap, FusionFinder, and ChimeraScan programs which are specifically designed to identify 
fusion genes. FusionlVlap, FusionFinder, and ChimeraScan identified 1017, 102, and 101 fusion transcripts, respectively, but 
CIC-DUX4 was not among them. Since the RNA sequencing data are in the fastq text-based format, we searched the files 
using the "grep" command-line utility. The "grep" command searches the text for specific expressions and displays, by 
default, the lines where matches occur. The "specific expression" was a sequence of 20 nucleotides from the coding part of 
the last exon 20 of CIC (Reference Sequence: Nl\/l_015125.3) chosen since all the so far reported C/C breakpoints have 
occurred here. Fifteen chimeric CIC-DUX4 cDNA sequences were captured and the fusion between the C/C and DUX4 genes 
was mapped precisely. New primer combinations were constructed based on these findings and were used together with a 
polymerase suitable for amplification of GC-rich DNA templates to amplify CIC-DUX4 cDNA fragments which had the same 
fusion point found with "grep". In conclusion, FusionlVlap, FusionFinder, and ChimeraScan generated a plethora of fusion 
transcripts but did not detect the biologically important CIC-DUX4 chimeric transcript; they are generally useful but 
evidently suffer from imperfect both sensitivity and specificity. The "grep" command is an excellent tool to capture chimeric 
transcripts from RNA sequencing data when the pathological and/or cytogenetic information strongly indicates the 
presence of a specific fusion gene. 
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Introduction 

The translocation t(4;19)(q35;ql3) was described by Richkind et 
al [1] as the sole chromosomal aberration in a tumor diagnosed as 
poorly differentiated extraskeletal mesenchymal sarcoma in a 12- 
year-old-boy. The authors mentioned that a similar translocation 
had also been reported as part of complex karyotype in an 
embryonal rhabdomyosarcoma (RMS) cell line [2] and as part of a 
three-way translocation t(4;19;12)(q35;ql3.1;ql3) in an undifi^er- 
entiated/embryonal RMS [3] and suggested that it might be a 
recurrent chromosomal aberration in malignant primitive mesen- 
chymal stem cells [1]. Sommers et al [4] described a subcutaneous 
primitive neuroectodermal tumor/Ewing sarcoma without 
EWSRl rearrangement but with a complex karyotype containing 
a t(4;19)(q33~35;ql3). Kawamura-Saito et al [5] described two 
cases of Ewing-like sarcoma which had a t(4;19)(q35;ql3) in their 
karyotypes. They also showed that the translocation resulted in 



fusion of the capicua transcriptional repressor C/C gene on 19qI3, 
which codes for a high mobility group box transcription factor, 
with the double homeodomain DUX4 gene on 4q35 [5]. 

DUX4 is located within a D4Z4 repeat array in the subtelomeric 
region of chromosome arm 4q [6] . A similar D4Z4 repeat array 
has been identified on chromosome 10 [7] . Each D4Z4 repeat unit 
has an open reading frame (named DUX4) that encodes two 
homeoboxes [6]. There is no evidence for transcription of this 
gene from standard cDNA libraries, but RT-PCR and in vitro 
expression experiments indicate that the ORE is transcribed [8,9]. 
The encoded protein is located in the nucleus, induces cell death, 
and has been reported to function as a transcriptional activator of 
paired-like homeodomain transcription factor 1 (PITXl) [8,9]. So 
far, there are roughly 20 reported cases of sarcoma with the 
t(4;19)(q35;ql3) and/or CIC-DUX4 fusion [1-5,10-15]. In seven 
other cases with CIC-DUX4, the DUX4 gene involved in the fusion 
apparently stems from the locus on 10q26 [13,16]. The current 
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data suggest that the CIC-DUX4 fusion defines a subgroup of 
primitive round cell sarcomas, different from Ewing sarcoma, with 
distinctive histopathology and rapid disease progression [1-5,10- 
15]. 

Recently, whole transcriptome sequencing (RNA-Seq, RNA 
sequencing) was shown to be an efficient tool in the detection of 
fusion genes in cancer [17]. In short, extracted RNA from cancer 
cells is massively sequenced, and then the raw data are analyzed 
with one or more programs specifically dedicated to the task of 
detecting fusion transcripts such as ChimeraScan [18], FusionMap 
[19], and FusionFinder [20]. However, the programs typically 
identify numerous fusion transcripts making the assessment of 
which of them are important and which are noise extremely 
difficult. To overcome this challenge, we and others have used 
combinations of cytogenetics and RNA-Seq to detect the 
"primary" fusion genes of neoplasms carrying only one or a few 
chromosomal rearrangements. A number of fusion genes were 
found using this approach, among them the recurrent ^CSHJ- 
BCOR in endometrial stromal sarcomas [21], IRF2BP2-CDX1 in a 
mesenchymal chondrosarcoma [22], and EWSRl-YYl in a subset 
of mesotheliomas [23]. In the present study, we performed whole 
transcriptome sequencing to study a small round cell tumor in 
which t(4;19)(q35;ql3) was part of a complex karyotype. While the 
fusion gene detection programs ChimeraScan [18], FusionMap 
[19], and FusionFinder [20] failed to detected the CIC~DUX4 
fusion transcript, the "grep" command-line utility captured the 
cytogeneticaUy indicated CIC-DUX4 fusion gene. 

Materials and Methods 

Ethics Statement 

The study was approved by the regional ethics committee 
(Regional komite for medisinsk forskningsetikk S0r-0st, Norge, 
http://helseforskning.etikkom.no). Written informed consent was 
obtained from the patient prior to her death. The ethics committee 
approval included a review of the consent procedure and all 
patient information has been anoiiymized and de-identified. 

Patient 

A 40-year-old female presented with pain in the lower part of 
the thoracic wall and imaging showed a tumor in thoracic skeletal 
muscle with extension into the retroperitoneum and costae. The 
histological diagnosis was small round cell sarcoma (Figwe 1). 
Immunohistochemistry demonstrated positive findings for vimen- 
tin, AE1/AE3, and CD99, but was negative for WTl, CD56, 
synaptophysin, chromogranin, MYF4, SMA desmin, CD3, CD20, 
CD45, CD79a, TdT, SlOO, and FLU. RT-PCR did not show 
gene fusion consistent with Ewing sarcoma [EWSRl -ERG / FLU) or 
synovial sarcoma {SS18-SSX1, 2 or 4). The patient received 
preoperative chemotherapy and the resected specimen disclosed a 
12 cm large tumor. The patient later developed lung metastasis 
and a local recurrence and died of sarcoma 1 0 months after the 
diagnosis. 

Chromosome banding analysis and fluorescence in situ 
hybridization (FISH) 

A sample from the surgically removed tumor was mechanically 
and enzymatically disaggregated and short-term cultured as 
described elsewhere [24]. The cultures were harvested and the 
chromosomes G-banded using Wright stain. The subsequent 
cytogenetic analysis and karyotype description followed the 
recommendations of the ISCN [25]. 

The BAG clone RP11-556K23 (chrl9:47422736-47630224), 
which maps to 19ql3.2 and contains the C/C gene, was retrieved 




Figure 1. Pathologic examination of the tumor. A) The 12 cm 

large tumour was localized in the skeletal muscle in the thoracic wall 
with extension to the retroperitoneum and costae. B) HE-stained slides 
showed a small round cell tumour. C) Immunexpression of CD99. 
doi:10.1371/journal.pone.0099439.g001 

from the Human genome high-resolution BAG re-arrayed clone 
set (the "32k set"; BAGPAG Resources, http://bacpac.chori.org/ 
pHumanMinSet.htm). Mapping data for the 32k human re-array 
are available in an interactive web format (http://bacpac.chori. 
org/pHumanMinSet.htm, from the genomic rearrays page) and 
can be obtained by activation of the ucsc browser track for the 
hgl7 UGSG assembly from the "32k set" homepage (http:// 
bacpac.chori.org/geiiomicRearrays.php). FISH mapping of the 
clone was performed on normal controls to confirm their 
chromosomal location. DNA was extracted and probes were 
labelled and hybridized according to Abbott Molecular recom- 
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mendations (http://www.abbottmolecular.com/home.html). Chro- 
mosome preparations were couiiterstained with 0.2 (J.g/ ml DAPI and 
overlaid with a 24x50 mm^ coversHp. Fluorescent signals were 
captured and analyzed using the CytoVision system (Leica Biosys- 
tems, Newcasde, UK). 

High-throughput paired-end RNA-sequencing 

Tumor tissue adjacent to that used for cytogenetic analysis and 
histologic examination had been frozen and stored at — 80°C. 
Total RNA was extracted from the tumor using Trizol reagent 
according to the manufacturer's instructions (Invitrogen, Oslo, 
Norway) and its quality was checked by Experion Automated 
Electrophoresis System (Bio-Rad Laboratories, Oslo, Norway). 
Three Hg of total RNA from the primary tumor were sent for high- 
throughput paired-end RNA-sequencing at the Genomics Core 
Facility, The Norwegian Radium Hospital (http://genomics.no/ 
Oslo/). The RNA was sequenced using an lUumiiia HiSeq 2500 
instrument and the lUumina software pipeline was used to process 
image data into raw sequencing data. Only sequence reads 
marked as "passed filtering" were used in the downstream data 
analysis. A total of 100 million reads were obtained. The softwares 
FusionMap (http://www.omicsoft.com/fusionmap/) [19], Fusion 
Finder (http:/ /bioinformatics.chUdhealthresearch.org.au/ software/ 
fusionfinder/) [20], and ChimeraScan (https://code.google.eom/p/ 
chimerascan/) [18] were used for the discovery of fusion transcripts. 
In addition, the "grep" command (http://en.wikipedia.org/wiki/ 
Grep) was used to search the fastq files of tlie sequence data 
(http://en. wikipedia.org/ wiki/FASTQ^format) for CIC sequence 
(NM_015125 version 3). 

FusionMap was run on a PGR with Windows XP professional 
as the operative system. FusionFinder, ChimeraScan, and "grep" 
command were run on a PC with Bio-Linux 7 as the operating 
system [26]. 

PCR 

The primers used for PCR amplification and sequencing are 
listed in Table 1. 

One |J.g of tumor total RNA was reverse-transcribed in a 20 |J,L 
reaction volume using iScript Advanced cDNA Synthesis Kit for 
RT-qPCR according to the manufacturer's instructions (Bio-Rad 
Laboratories, Oslo, Norway). Initially, the 25 |aL PGR-volume 
contained 12.5 [ih of Premix Taq (Takara Bio Europe/SAS, 
Saint-Germain-en-Laye, France), 1 |J.L of the synthesized cDNA, 
and 0.4 |j,M of each of the forward CIC-4105F and reverse 



Table 1. Primers used for PCR amplifications and sequencing. 




Oligo Name 


Sequence (5 — >3') 


CIC-4105F 


CGAAGAGCGCTTTGCTGAGTTGCC 


CIC-4283F 


AGAAGACGCTCCAGCTGCAGCTCG 


CIC-4377F 


CCGAGGACGTGCTTGGGGAGCTA 


CIC-4453F 


GGCCCTGGTCATGCAGCTCTTTCA 


CIC-4856R 


CTCAGGGGTCCCTCACCTGCCTGT 


CIC-4958R 


CCCAAACTGGAGAGGACGAAATGGC 


DUX4-1053R 


ACCGAGGAGCCTGAGGGTGGGAG 


DUX4-1151R 


CTTGAGCGGGCCCAGGCTGTG 


DUX4-1507R 


CTTCCAGCGAGGCGGCCTCTTC 


DUX4-1538R 


GCAGAGCCCGGTATTCTTCCTCGC 


doi:l 0.1 371 /journal.pone.0099439.t001 
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DUX4-1538R primers. One ^iL of the T' PCR amplification was 
used as template in a nested PCR with the forward GIC-4283F 
and reverse DUX4-1507R primers. For the quality of the cDNA 
synthesis the primers CIC-4238F and CIC-4958R were used to 
amplify a CIC cDNA fragment. The PCRs were run on a C-1000 
Thermal cycler (Bio-Rad Laboratories) with the following cycling 
conditions: an initial denaturation at 94°C for 30 sec followed by 
35 cycles of 7 sec at 98°C and 2 min at 68°C, and a final extension 
for 5 min at 68°G. 

In subsequent PCR amplifications, PrimeSTAR GXL DNA 
polymerase was used (Takara Bio). According to the company's 
information this is a high fidelity polymerase suitable for GC-rich 
templates that are otherwise difficult to amplify. The 25 |J,L PCR 
volume contained 1 x PrimeSTAR GXL Buffer (Takara Bio), 
1 (iL of the synthesized cDNA, 200 |xM of each dNTP, 0.4 |j,M of 
each of the forward primer CIC-4377F and the reverse primer 
DUX4-1151R or 0.4 |j,M of each of the primers CIC-4453F and 
DUX4-1053R. The PCR was run on a C-1000 Thermal cycler 
(Bio-Rad Laboratories) with an initial denaturation at 94°C for 
30 sec, followed by 35 cycles of 7 sec at 98°C, 2 min at 68°C, and 
a final extension for 5 min at 68°C. Three |a.L of the PCR products 
were stained with GelRed (Biotium, Hayward, CA, USA), 
analyzed by electrophoresis through 1.0% agarose gel, and 
photographed. 

The rest of the amplified PCR products were purified using the 
NucleoSpin Gel and PCR Clean-up kit (Macherey-Nagel, 
VWR International, Oslo, Norway). Direct sequencing (Sanger 
sequencing) was performed using the light run sequencing service 
of CATC Biotech (http://www.gatc-biotech.com/en/sanger- 
services/lightrun-sequencing.html). The BLAST software (http:// 
blast.ncbi.nlm.nih.gov/Blast.cgi) was used for computer analysis of 
sequence data. The nucleotide sequence has been deposited in the 
GenBank with accession number KJ670706. 

Results 

G-banding analysis yielded the diagnostic karyotype 46,XX,del(2) 
(ql3q23),t(4;19)(q35;ql3),ins(l l;?)(ql l;?),der(20)?t(20;20)(pl l;ql 1) 
[14]/46,XX[3] (Figure 2A). When metaphase spreads (Figure 2B) 
were hybridized with the BAG- RP11-556K23, one split signal 
was seen, indicating that the translocation breakpoint on 
chromosome 19 was within the BAG (Figure 2C). This clone 
contains, apart from CIC, the genes GSK3A, ERF, PAFAH1B3, 
PRR19, TMEM145, MEGF8, CNFN, and UPE (Figure 2D). 



The "Grep" Command Captures CIC-DUX4 Chimera from RNA Sequencing Data 



II 



0 O 

g 



)l H )i II II & n 




Chr19(q13.2) I 19p13.3 19pTl2~ 



D 



Scale 

chr19: 47,450,0001 



100 kbh 



IS 17 



IS 



h der(4) 



q13.2| 13.32 



47,500,000 I 47,550,000 I 

BCGSC Human BAC Rearray 



H hg17 

47,600,000 I 



RP11-556K2a 



MapContigsl 

ZNF526^ 
GSK3A 



Physical Map Contigs 



CIChH«tTMEM145 

PAFAH1B3I-H MEGFSt 
PAFAH1B3h+t MEGFSt 
PAFAH1B3F-H 
PRR19H 



I I I II B II I IIIII 



ll ll l l l l il ll I I I » 



CNFNH-i LIPE-AS1> 
I- LIPEI I II H i 



LIPE-AS1 ^ 



CXCL17MH'^ 



Figure 2. Cytogenetic and FfSH analyses of the tumor. A) Karyogram showing chromosome aberrations del(2)(q13q23), t(4;19)(q35;q13), 
ins(1 1;?)(q1 1;?), and der(20)?t(20;20)(p1 1;q1 1); breal<points are indicated by arrows. B) FISH performed on metaphase spread using BAC RP556K23 
(green signal) from 19q13 containing the CIC gene. A part from this probe has moved to the derivative chromosome 4. The der(4), der(19), and the 
normal chromosomes 4 and 19 are indicated by arrows. C) G- banding of the metaphase spread shown in (B). The der(4), der(19) and the normal 
chromosomes 4 and 19 are indicated by arrows. D) The location of the BAC RP556K23 on chromosome 19 and the genes found in this region. The 
data obtained from UCSC Genome Browser (http://genome.ucsc.edu/). 
doi:1 0.1 371 /journal.pone.0099439.g002 



The initial PGR with Premix Taq and the primer set CIC- 
4105F/DUX4-1538R as weU as the nested PGR with the primers 
GIG-4283F/DUX4-1507R failed to amplify any cDNA frag- 
ments. However, the primer set CIC-4238F/GIG-4958R amph- 
fied a CIC cDNA fragment suggesting that the synthesized cDNA 
was of good quality (Figure 3A). Because of the negative RT-PGR 
results, whole transcriptome sequencing was performed and the 
sequencing data were analyzed with FusionMap, FusionFinder, 
and GhimeraScan which are programs designed to detect fusion 
genes from hi,gh throughput sequencing data [18,19,20]. 



FusionMap identified 1024 potential fusion transcripts (Table 
SI) but CIC-DUX4 was not among them. Neither GSK3A, ERF, 
PAFAH1B3, PRR19, TMEM145, MEGF8, CNFN, nor UPE, the 
other genes which are localized on the FISH probe, were found to 
be partners in the detected fusion transcripts. FusionFinder and 
GhimeraScan identified 103 and 101 potential fusion transcripts, 
respectively (Tables S2 and S3), but again CIC-DUX4 was not 
among them. Neither GSK3A, ERF, PAFAH1B3, PRR19, 
TMEM145, MEGE8, CNFN nor LIRE, the other genes within 
the BAG, were found to be partners in the detected fusion 
transcripts. 
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Figure 3. RT-PCR results for the expression of CIC-DUX4 in tlie tumor. A) The initial PCR with Premix Taq and the primer set CIC-4105F/ 
DUX4-1538R (lane 1) as well as the nested PCR with the primers CIC-4283F/DUX4-1507R (lane 2) did not amplify any cDNA fragments. The primer set 
CIC-4238F/CIC-4958R (lane 3) amplified a CIC cDNA fragment suggesting the good quality of the synthesized cDNA. Lane 4, Blank, no RNA in cDNA 
synthesis. B) PCR amplifications using the PrimeSTAR GXL DNA polymerase and the primer combinations CIC-4377F/DUX4-1 151R (lane 5) and CIC- 
4453F/DUX4-1053R (lane 6). Lane 7, Blank, no RNA in cDNA synthesis. M, 1 Kb DNA ladder (GeneRuler, Fermentas). C) Partial sequence chromatogram 
of the amplified cDNA fragment showing that CIC is fused to DUX4. 
doi:1 0.1 371 /journal.pone.0099439.g003 



Since fastq is a text-based format of the sequence data, we 
decided to use the "grep" command-line utility and search for 
sequences which contained part of the last exon of CIC (exon 20, 
nucleotides 4500-.5473 in the sequence with accession number 
NM_015125 version 3). The search terms were 
"GCCGCCTTCCAGGCCCGCTA" (nt 4511-4530) and 
"CAGGGGGCCCTGACCCCACC" (nt 4701-4720). The first 
search term extracted 76 sequences containing CIC cDNA 
fragments (data not shown). The second search term extracted 
22 sequences. Blasting of each of these sequences with the human 
genomic plus transcript database (http://blast.ncbi.nlm.nih.gov/ 
Blast.cgi), CIC mRNA reference sequence NM_015125.3, and 
DUX4 mRNA reference sequence NM_033 178.4 showed that 15 
out of the 22 were chimeric CIC-DUX4 cDNA fragments (Table 2). 
The fusion had occurred between nt 4724 of CIC mRNA reference 
sequence NM_015 125.3 and nt 771 oi DUX4 mRNA reference 
sequence NM_033 178.4. Using the search term "CCCACCT- 
CACCGGCAGAGGG" which is composed of 10 nt of CIC 
(CCCACCTCAC) and 10 nt (CGGCAGAGGG) of DUX4 



upstream and downstream of the fusion point, 1 9 sequences were 
retrieved, 15 of which were those found with the 
"CAGGGGGCCCTGACCCCACC" search term. 

To verify the data obtained with the "grep" command, PCR 
amplifications were performed using the PrimeSTAR GXL DNA 
polymerase. Both primer combinations, CIC-4377F/DUX4- 
1151R and CIC-4453F/DUX4-1053R, amplified cDNA frag- 
ments (Figure 3B). Sanger sequencing verified that they were CIC- 
DUX4 fusion transcripts which had the same fusion point found 
with the "grep" command (Figure 3C). 

Discussion 

Our initial negative result for CIC-DUX4 fusion with RT-PCR 
prompted us to investigate the tumor using whole transcriptome 
sequencing. The small round cell tumor had the t(4;19)(q35;ql3) 
translocation as part of its karyotype and in addition a split signal 
of the BAC RP11-556K23 (mapped on 19ql3), which contains 
CIC, features that led us to nevertheless believe strongly that a CIC- 
DUX4 fusion must be present. However, also GSR'SA, ERF, 
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CIC-4105F 

1 |cgaagagcgctttgctgagttgcc| tgagtttcggcctgaggaggtgctgccctcccccacc 

EERFAELPEFRPEEVLPSPT 
62 ctgcagtctctggccacctcaccccgggccatcctgggctcttaccgcaagaagaggaag 

LQSLATSPRAILGSYRKKR K 

122 aactccacggacctggattcagcacccgaggaccccacctcgcccaagcgcaag :: |aga| 
N S T D L D S A P E D P T S P K R K R 
CIC-4283F-> 

182 |agacgctccagctgcagctcg| gagcccaacacccccaagagtgccaagtgcgagggggac 
RRSSCSSEPNTPKSAKCEGD 

CIC-4377F-> 

242 atcttcacctttgaccgtacaggtacagaag |ccgaggacgtgcttggggagcta| gagtat 
IFTFDRTGTEAEDVLGELEY 

CIC-4453F-> 

302 gacaaggtgccatactcctccctgcggcgcaccctggaccagcgccg lggccctggtcatg] 

DKVPYSSLRRTLDQRRALVM 

CIC-4453F-> "grep" 1 

362 |cagctctttca| ggaccatggcttcttcccgtcagcccaggccaca |gccgccttccaggcc| 

QLFQDHGFFPSAQATAAFQA 

422 |cgcta| tgcagacatctttccctccaaggtttgtctgcagttgaagatccgtgaggtgcgc 

RYADIFPSKVCLQLKIREVR 
482 cagaagatcatgcaggctgccactcccacggagcagccccctggagctgaggctcctctc 

QKIMQAATPTEQPPGAEA P L 
542 cctgtaccgccccccactggcaccgctgctgcccctgcccccactcccagccccg [caggg| 

PVPPPTGTAAAPAPTPSPAG 
"grep" 2 

602 |ggccctgaccccacc| tcajggcagagggggtctcccaacctgccccggcgcgcggggat 
GPDPTSPAEGVSQPAPARGD 

<-DUX4-1053R 

662 ttcgcctacgccgccccggctcctccggacggggcgct |ctcccaccctcaggctcctcgg| 

FAYAAPAPPDGALSHPQAPR 
722 [^ggcctccgcacccgggcaaaagccgggaggaccgggacccgcagcgcgacggcctgccg 
WPPHPGKSREDRDPQRDGLP 

<-DUX4-1151R 

782 ggcccctgcgcggtgg |cacagcctgggcccgctcaag| cggggccgcagggccaaggggtg 

GPCAVAQPGPAQAGPQGQGV 
842 cttgcgccacccacgtcccaggggagtccgtggtggggctggggccggggtccccaggtc 

LAPPTSQGSPWWGWGRGPQV 
902 gccggggcggcgtgggaaccccaagccggggcagctccacctccccagcccgcgcccccg 

AGAAWEPQAGAAPPPQPAPP 
962 gacgcctccgcctccgcgcggcaggggcagatgcaaggcatcccggcgccctcccaggcg 
DASASARQGQMQGIPAPSQA 
1022 ctccaggagccggcgccctggtctgcactcccctgcggcctgctgctggatgagctcctg 

LQEPAPWSALPCGLLLDELL 
1082 gcgagcccggagtttctgcagcaggcgcaacctctcctagaaacggaggccccgggggag 
ASPEFLQQAQPLLETEAPGE 

<-DUX4-1507R <-DUX4-1538R 

1142 ctggaggcctcg |gaagaggccgcctcgctggaag| cacccctca |gcgaggaagaataccgg| 

L E A SEEAASLEAPLSEEEYR 
1202 |gctctgc| 1208 
A L 

Figure 4. A putative 1208 bp CIC-DUX4 fusion transcript whicKi would Kiave been amplified using the the forward CIC-4105F and 
reverse DUX4-1538R primers. All the primers used in the study are denoting the primers sequences (in box) together with orientation (arrows). 
The search sequences "GCCGCCTTCCAGGCCCGCTA" ("grep" 1) and "CAGGGGGCCCTGACCCCACC" ("grep" 2) used as search terms in the "grep" 
command-line utility are colored yellow and in box. The fusion point between CIC and DUX4 is in red. The part of the protein coded by this CIC-DUX4 
fusion transcript fragment is shown under the nucleotide sequence. The nucleotide sequence has been deposited in the GenBank with accession 
number KJ670706. 

doi:1 0.1 371 /journal.pone.0099439.g004 



PAFAH1B3, PRR19, TMEM145, MEGF8, CNFN, and LIFE were 
present in the BAG bridging the breakpoint and could conceivably 
be the gene-level target of the chromosomal split. It was therefore 
surprising that no signs of any CIC-DUX4 were evident when we 
analyzed the raw sequencing data using ChimeraScan [18], 
FusionMap [19], and FusionFinder [20], fusion-fmder programs 



that have all been evaluated recently on a synthetic dataset as well 
as real datasets that included experimentally validated chimeras 
[27,28]. All three programs produced a plethora of fusion 
transcripts but none of them contained CIC or any of the other 
8 genes found in the split RPl 1-556K23 FISH probe. We then as 
a last resort decided to search for CIC sequences in the whole 
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transcriptome sequencing data set using the "grep" command-line 
utility. The rationale was: 1) the RNA sequencing data are in fastq 
format files (filename.fastq) and fastq is a text-based format 
(http://en.wikipedia.org/ wiki/FASTQ^format) and 2) the se- 
quence data can be searched using the "grep" command-line 
utility (http://en.wikipedia.org/wiki/Grep). The "grep" com- 
mand-line utility is used for searching text or a file for specific 
expressions. By default, "grep" displays the lines where matches 
occur. Our "specific expression" was a sequence of 20 nucleotides 
from the coding part of the last exon (20) of CIC (Reference 
Sequence: NM_015125.3) since all the so far reported CIC 
breakpoints have occurred in that part of the CIC gene [5,12-14]. 
The sequences obtained by "grep" were blasted against the 
human genomic plus transcript database (http://blast.ncbi.nlm. 
nih.gov/Blast.cgi) in order to identify possible chimeric fragments 
containing part of CIC and part of another gene. 

This approach allowed us to obtain from the RNA sequencing 
fastq file 15 chimeric CIC-DUX4 cDNA sequences (Table 2) and to 
map the fusion between the CIC and DUX4 genes precisely. 
Subsequently, four more chimeric CIC-DUX4 sequences were 
identified using a 20-mer sequence containing the fusion point as 
"specific expression" in the "grep" command-line utility. The 
fusion occurred between nt 4724 of CIC mRNA reference 
sequence NM_015125.3 and nt 771 oi DUX4 mRNA reference 
sequence NM_033 178.4. This fusion has not been reported before 
[5,12-14]. C/C fusions have been reported at nt 4552, 4579, 4740, 
4750 [12-14,16] and for Zirar^ at nt 1071, 1078, and 1145 of the 
reference sequence with accession number NM_033 178.4 [5,12- 
14]. 

An explanation for the failure of the initial PGR is that the 

target C/C-Z)f/Z^ chimeric serjuence between CIC-4105F/DUX4- 
1538R primers was 1 208 bp long with 70yo CG content (Figure 4). 
The primer combinations CIC-4377F/DUX4-1151R and CIC- 
4453F/DUX4-1053R togedier with a PrimeSTAR GXL DNA 
polymerase, suitable for GG-rich templates, amplified fragments 
546 bp long with 70% CG content and 374 bp long with 69% CG 
content, respectively (Figures 3B and 4). Sanger sequencing 
verified that they were CIC-DUX4 fusion transcripts which had 
the same fusion point found with the "grep" command-line utility. 
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